Estimating Document Similarity using Auxiliary Category Information

نویسنده

  • Gerhard Paass
چکیده

We have developed a novel approach to determine the similarity of documents using probabilistic latent semantic indexing. For each document a probability vector of latent factors is estimated which on the one hand takes into account the distribution of words in the text and on the other hand the distribution of category values. The emphasis can be freely shifted between both aspects and therefore the method allows to select a similarity measure which is more appropriate to the domain. In a preliminary evaluation we determined articles which are most similar to a target document. In spite of the small training set the groups of similar articles exhibits a remarkable coherence.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Determining Word Sense Dominance Using a Thesaurus

The degree of dominance of a sense of a word is the proportion of occurrences of that sense in text. We propose four new methods to accurately determine word sense dominance using raw text and a published thesaurus. Unlike the McCarthy et al. (2004) system, these methods can be used on relatively small target texts, without the need for a similarly-sensedistributed auxiliary text. We perform an...

متن کامل

A New Inductive Learning Method for Multilabel Text Categorization

In this paper, we present a new inductive learning method for multilabel text categorization. The proposed method uses a mutual information measure to select terms and constructs document descriptor vectors for each category based on these terms. These document descriptor vectors form a document descriptor matrix. It also uses the document descriptor vectors to construct a document-similarity m...

متن کامل

Similarity Model and Term Association for Document Categorization

This paper addresses similarity model and term association for similarity-based document categorization. Both Euclidean distance– and cosine-based similarity models are widely used for measures of document similarity in information retrieval and document categorization community. These two similarity models are based on the assumption that term vectors are orthogonal. Term associations are igno...

متن کامل

Grieser, Karl, Timothy Baldwin, Fabian Bohnert and Liz Sonenberg (2011) Using Ontological and Document Similarity to Estimate Museum Exhibit Relatedness, ACM Journal of Computing and Cultural Heritage 3(3), pp. 1-20

Exhibits within Cultural Heritage collections such as museums and art galleries are arranged by experts with intimate knowledge of the domain, but there may exist connections between individual exhibits that are not evident in this representation. For example, the visitors to such a space may have their own opinions on how exhibits relate to one another. In this paper, we explore the possibilit...

متن کامل

Oracle at TREC 10: Filtering and Question-Answering

Oracle’s objective in TREC-10 was to study the behavior of Oracle information retrieval in previously unexplored application areas. The software used was Oracle9i Text[1], Oracle’s full-text retrieval engine integrated with the Oracle relational database management system, and the Oracle PL/SQL procedural programming language. Runs were submitted in filtering and Q/A tracks. For the filtering t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003